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ZERO OVERHEAD SELF-TIMED ITERATIVE LOGIC 

FTET.n OF THE INVENTION 
This invention relates to digital electronic 
circuits, and, more particularly, to self -timed circuits, 
5 including iterative division algorithms. The design 
technique of the present patent is called "Zero-Overhead 
Self -Timed Iterative Logic," abbreviated ZOSTIL. 

BACKGROU WP OF THE INVENTION 
The timing performance of any system can be judged by 

10 one of two measures: latency or throughput. The delay 
from an input to the resulting output is called the 
latency, and most real world problems desire this delay to 
be minimized. If a system can have several computations 
in progress at once, then the minimum delay between two 

15 successive inputs determines the throughput, which is the 
maximum data rate at which the system can accept requests 
for computation. Performance assessed by either of these 
measures depends on the sum of the raw propagation delay 
through the combinational logic of the desired function 

20 plus "other" overhead delays. From a theoretical point of 
view, the fastest circuit would eliminate all overheads 
and have circuit delays due to only the raw combinational 
logic. The innovations in this patent reduce the latency 
overhead in a pipeline to zero. Hence, the ZOSTIL innova- 

25 tion will produce functions whose latency attains the 
theoretical lower bound, but without requiring the large 
and costly area of a full combinational array. 
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Traditional synchronous circuit design techniques 
separate combinational logic from data storage. That is, 
storage is provided by explicit latches interposed between 
sections of combinational logic. This design technique 
5 has at least four sources of overhead which increase 
circuit latency: 1) propagation delay through latches; 
2) margin added to tolerate clock skew; 3) wasted time in 
fast stages within the system; 4) maximizing data- 
dependent delay; and 5) the assumption of worst case 
10 timing of components. 

The first source of latency overhead is due to 
latches because they introduce additional delays due to 
their set-up time and propagation delays. The minimum 
cycle time of a synchronous circuit is the sum of the 

15 latch set-up time, latch propagation delay, and maximum 
combinational logic delay. The first innovation in the 
' ZOSTIL methodology is remove this overhead completely by 
removing the explicit latches altogether and making use of 
the "free" half -latch at the output of each stage in a 

20 CMOS domino chain. 

The second source of latency overhead comes from 
needing to distribute the clock to all latches in the 
system. Communicating stages must be in agreement as to 
when the clock edges occur, but wire or driver delays 

25 cause clock skew which must be compensated for by adding 
some margin to the total clock period. This added margin 
is also overhead. Previous asynchronous design techniques 
used handshaking blocks to remove global clocks and the 
extra latency overhead due to clock skew by communicating 

30 data validity locally instead of globally. But these 
previous techniques include explicit latches, and hence, 
still had the latency overhead due to latch propagation 
delays. Previous techniques also added some overhead due 
to the forward directed paths within the handshaking 

35 logic. The second ZOSTIL innovation is to insure all 
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control paths operate in parallel with the forward evalua- 
tion rather than adding sequentially to the path. 

The third source of latency overhead is due to 
mismatching of the functional sections between the 
5 latches. Because the amount of time in a clock period is 
fixed, it must be set equal to the longest propagation 
delay of all of the different functional sections in the 
system. The difference between that maximum and the 
actual time used by any functional section is overhead 
10 because it is wasted time. A self -timed dataflow does not 
waste this time because it allows data to flow forward 
based on data-driven local control, rather than waiting 
for clock edges. Although the throughput of a pipeline is 
.still limited by its slowest stage, the latency is 
15 improved by letting each stage progress as soon as it can. 

The fourth source of latency overhead comes from 
determining critical paths in synchronous logic based on 
the worst-case data values. If there is a large variance 
then there is a large performance loss due to the dif fer- 

20 ence between the average and maximum values of delay. 
Synchronous designers try to adjust transistor sizing to 
• equalize the various paths through a body of logic, but in 
self-timed systems it is desired to minimize the proba- 
bilistic expected value of the delay rather than minimiz- 

25 ing the maximum delay. The third innovation of this 
patent is to make use of any known probabilistic distribu- 
tion of the inputs of each block of logic in order to size 
the transistors in that block to minimize the expected 
value of the total delay. 

3 0 The fifth source of latency overhead is the derating 

used to insure performance over a range of temperature and 
voltage levels. Synchronous system design must always be 
based on conservative derated "worst-case" specifications 
because the system must work at the environmental 
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extremes. But when the actual conditions are not at the 
extremes, the difference between the possible performance 
and the actual designed performance is wasted performance. 
Self-timed components will always run at their maximum 
5 speed for the existing conditions and deliver their 
outputs as soon as they are actually finished. By provid- 
ing completion indication, they allow an enclosing system 
to make use of the output sooner than always waiting for 
the worst case. 

10 Background and Nomencla t ure for Dual-Monotonic Signals 
If A is a dual-monotonic signal, it is be represented 
by two "sub-signals", called A 0 and A 1 , with the encoding: 
if both of the wires are in the same logical state, say 
low, then the signal A has not yet evaluated; if either A 0 

15 or A 1 changes state, this communicates the signal A has 
finished evaluating, and the state of A is determined by 
noting which of the two wires changed. For example, if 
both A 0 and A 1 have the binary value '0', then the value of 
the signal, A, is not yet determined. If A 1 transitions to 

20 '1', then the value of A is '1', while if A°transitions to 
'1', then the value of A is '0'. The pair of wires is 
called a dual-monotonic pair because the transitions on 
the wires must be monotonic during evaluation. These 
transitions are mutually exclusive, and either one indi- 

25 cates the evaluation is complete and can be used by other' 
circuits. In this patent, signal names are italicized, 
and a "*" is used to indicate logical inversion. Also, 
each half of a dual-monotonic signal will have a super- 
script of 1 or 0. 

30 Backgrou nd on Domino Logic 

Monotonic signals can be conveniently generated by 
CMOS domino logic. Each signal can be in one of three 
functional phases: 1) precharge or reset, 2) logic 
evaluation, or 3) data storage. These three phases are 

35 shown in Figure 1, which shows a two-input dual-monotonic 
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AND gate and its waveform diagram. During the reset 
phase, the active low precharge signal, P*, is active and 
the A and 8 signals must be inactive. This causes the 
precharged nodes X* and Y* to be high, and the Q outputs, 
5 to be low. In the logic evaluation phase, either A 0 or A 1 
and either B° or B 1 will transition high monotonically. If 
both A 1 and B 1 transition high, the AND gate's Q 1 output 
monotonically transitions high, and if either A 0 and B° go 
active, the Q° output will go high. During the data 

10 storage phase, both A and 8 signals are forced low, and P* 
remains inactive. This condition leaves the precharged 
nodes X* and Y* undriven, and capacitance causes them to 
act as a memory element so the outputs, Q 1 and Q°, remain 
in the same state as they were during the logic evaluation 

15 phase. Thus, each domino stage includes a "free" half- 
latch because no additional transistors and no additional 
logic delays are needed to store data. 

overview of the Innovations 
CMOS domino logic is normally used only in two 

20 phases: precharge and logic evaluation. The invention of 
the present patent uses a third phase to store data, which 
allows domino logic gates to be cascaded and pipelined 
without intervening latches. The inputs to this system 
must have strictly monotonic transitions during the logic 

25 evaluation phase and the precharge signal must be active 
during only the precharge phase. Furthermore, the pipe- 
lined system can feed its output back to the input to form 
an iterative structure. Such a feedback pipeline is 
viewed as a "loop" or "ring" of logic which circulates 

30 data until the entire computation is complete. 

The innovation of making use of the temporary storage 
of a precharged function block allows the explicit latches 
to be omitted. Each domino stage provides the operation 
of a half -latch for free. The Reset Control logic oper- 
35 ates completely in parallel with the function block 
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evaluation. Completion detection logic in each Reset 
Control block observes the output of the following Func- 
tion Block to determine when all of its outputs have 
finished evaluating and then instructs its own Function 
5 Block to move from the data storage phase to the precharge 
phase, driving all its outputs to the reset state. When 
the outputs of the following Function Block subsequently 
become reset, the Reset Control turns off the precharge 
signal for its Function Block, causing it to be ready for 

10 the data evaluation phase when its next data input actu- 
ally arrives. By encoding the data in dual-monotonic 
pairs, there is no forward handshake required and thus the 
control logic is removed from the critical path of the 
circuit. This innovative methodology, in conjunction with 

15 the first innovation removing the need for explicit 
latches, yields a truly zero overhead minimum latency 
delay path through pipelined, logic. 

The ZOSTIL technique includes combining the latch- 
free circuits and parallel Reset Control into an iterative 
20 structure, or "Ring." This is particularly important for 
arithmetic operations which perform the same basic func- 
tion over and over. Examples of these type of functions 
are: multiplication, division, square root, sine, and 
cosine. 

25 ZOSTIL circuits are robust because, with proper 

design of the control logic, they are delay-independent. 
That is, the circuits will function correctly regardless 
of the actual delays of the circuit elements. Therefore, 
calculations involving delays are not necessary to insure 

30 the logical correctness or functionality of the system, 
but are used only to estimate the performance. This 
contrasts to synchronous design techniques which require 
extensive delay calculations to insure all computations 
within a single logic stage can be performed in one clock 

35 cycle. Improper delay estimation may result in a synch- 
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ronous circuit which does not always produce the correct 
result. 

Division algorithms generate a quotient by successive 
determination of quotient digits from most significant to 
5 least significant. Because each quotient digit is used in 
the computation of the next partial remainder, which in 
turn is required to determine the next quotient digit, 
division is an inherently sequential process. Hence, a 
pipelined ring designed with the ZOSTIL technique is ideal 

10 for performing arithmetic division. An additional innova- 
tion specific to division is to overlap and interlock 
stages to allow two remainder computations to occur in 
parallel. This is accomplished by modifying an algorithm, 
known as SRT division, to perform several small remainder 

15 computations in parallel and choose the correct remainder 
when the quotient digit from the previous stage is deter- 
mined. This innovation improves the overall latency by a 
factor of two in comparison with the previous algorithms. 

BRIEF DESCRIPTION OF THE DRAWINGS 
20 The present invention will be better understood by 

reviewing this description given with reference to the 
following drawings: 

Figure 1: Self -Timed Domino Looic AND Gate . This is 
a schematic of a two-input dual-monotonic self-timed AND 

25 gate constructed in CMOS technology; 

Figure 2: Precharaed Func tion Blocks. This is a 
linear pipeline of precharged function blocks which also 
includes logic for completion detection, which is used to 
reset the function blocks; 

30 Figure 3: Datapaths Merging . This is a schematic 

showing the Control Reset Logic needed to merge two self- 
timed pipelines; 
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Figure 4: natanaths Splitting. This is a schematic 
showing the control Reset logic needed to split a self- 
timed pipeline; 

Figure 5: ^proving Expected Tota l Delay. This 
5 shows how changing the circuit topology without changing 
the function can improve the expected circuit performance; 

Figure 6: t^emc for Four Stage Sf>1f-Tjmed Ring. 
This is a four stage pipeline ring of precharged function 
blocks ; 

10 Figure 7: np.nendenc y Graph for Four Stag* Self-Timed 

R i nq . This is dependency graph of the schematic shown in 
Figure 6; 

Figure 8: sequenti a l Data-flow in Ordinary Radix 2 
g pT Division . This is schematic showing the dataflow in 
15 one stage of previously described SRT division algorithm; 
and 

Figure 9 : Tntra-sta p p overlapped Execution of Radix 
g spt Division . This is a schematic showing the improve- 
ment in SRT division. 

20 D FT AILED PF SCRTPTION OF THE INNOVATIONS 

This patent develops innovations in asynchronous 
circuit design technique leading to "Zero-Overhead Self- 
Timed Iterative Logic," abbreviated ZOSTIL. 

fin* 7.QSTII . Technique 

25 Asynchronous circuits have the potential for avoiding 

the latency overheads of synchronous circuits. By commu- 
nicating completion status along with the data, each 
processing element can begin to operate on data as soon as 
it arrives without waiting for re-synchronization to a 

30 global clock at every latch. 

Previous implementations of asynchronous logic used 
explicit latches to store intermediate results. These 
latches introduce additional propagation delay to the 
circuit's critical path, but do not directly contribute to 
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the computational function. The first innovation of this 
patent is to avoid explicit latches entirely by using CMOS 
domino function blocks as "free" half-latches. This is 
possible only if the control for the function block pre- 
5 charge makes certain the outputs from a function block 
have been utilized by all subsequent stages before reset- 
ting a function block and destroying the data. In order 
to determine when succeeding function blocks are finished 
using the data, it is necessary to construct a completion 
10 detector. 

A simple OR gate connected to the two wires of each 
dual-monotonic pair output provides a done indicator for 
each individual signal output from a logic stage. The 
. stage is considered done computing when all of its data 

15 outputs are individually done. A tree of last-of gates, 
commonly called C-elements, can be used as the completion 
detector to determine when all of the bits in a datapath 
have changed. Each C-element has the property that its 
output is that of the inputs when they were last the same. 

20 The output of a tree of C-elements will indicate done when 
all of the inputs are done and the output will indicate 
reset when all of the inputs have reset. 

Once the completion signals are generated, they must 
be used to provide the control for resetting the pre- 

25 charged blocks appropriately. Previous self -timed 

circuits required both forward and backward handshakes, 
but the second innovation of this patent is to completely 
embed the completion indication of the forward data in 
dual-monotonic pairs and to eliminate the forward hand- 

3 0 shake. Further, the backward handshake can be designed so 
that it does not affect the forward critical path. For a 
simple unidirectional data flow, this control is a 
sequence of backward pointing inverters as shown in 
Figure 2. For merging and splitting datapaths, the con- 

35 trol is as shown in Figures 3 and 4. None of these con- 
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trol circuits are in the direct path of the forward flow- 
ing data and hence they do not add to the latency of the 
forward flowing wave of data. 

The control logic resets each function block before 
5 the next wave of data comes along and causes the function 
block to evaluate its outputs again. As long as the data 
waves are spaced apart, each data wave will propagate with 
latency equal only to the pure combinational delay, 
without any additional overhead. 

10 If a problem requires repetitive execution of a 

logical function, then it is particularly appropriate to 
build a ring of precharged function blocks. The function 
blocks can be the same, or they may implement different 
functions. The width of the datapath between stages of 

15 the ring need not be constant. The data in a self -timed 
ring loops around the ring at the same speed as it could 
progress through a large combinational array, but the 
silicon area of the circuit is much reduce A physical 
analogy to this is a circle of dominoes. The trick is to 

2 0 make the wave of falling dominoes progress around the 

circle continuously at the same speed as it would down a 
long row of dominoes, and this is accomplished by standing 
each domino back up after its successor has fallen. 

Usually a linear pipeline is judged by the throughput 
25 of the stages. But when the stages are connected into a 
ring to solve a single iterative problem, the time it 
takes to compute the answer is dependent on the latency 
through the stages. So having low latency is the 
important property for the stages in a loop rather than 

3 0 throughput. Zero overhead control logic reduces the 

latency to the lower bound of strictly the combinational 
delay of the function blocks. 

Dependency Graphs Verifying Zero Overhead 
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The ZOSTIL technique results in circuits which func- 
tion correctly independent of the actual delays taken by 
each of the blocks. The designs are thus robust since 
changes in the delays will not affect the logical opera- 
5 tion of the circuit. But, the actual delays determine the 
overall performance, and relative delays determine which 
path through the circuit is the limiting, or critical, 
path. The objective of "Zero-overhead" design is to make 
sure that the performance is limited only by the function 

10 blocks comprising the desired combinational logic, and 
hence that the critical path under nominal relative delays 
does not go through any control blocks. If the perform- 
ance analysis shows the circuit does not achieve zero 
overhead, then the schematic can be modified by making the 

15 stages more finely grained until zero overhead latency is 
achieved. 

To insure a design has no overhead due to control, 
dependency graphs are drawn to illustrate all possible 
critical paths. A simple ring and its dependency graph 

2 0 are shown in Figures 6 and 7. The critical cycle time of 

data flowing around the ring will be the longest cyclic 
path in the graph. The graph is really a restricted 
Petri-net and the firing rules are the same; a node is 
marked when all of its predecessors have been marked. 

25 Adjusting Logic and Transistor Sizing 

Based on Data P robabilities . — 

Since self-timed circuits accompany data with comple- 
tion signals, subsequent computations may begin as soon as 
each data arrives. Since it is not required that process- 

3 0 ing times are the same, what is really desired is that the 

total expected value of delay is minimized. In cases 
where data values are distributed with equal probability, 
the expected value is, of course, minimized when the 
average delay is minimized. However, in some cases, the 
3 5 designer may know that data values will have a particular 
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distribution, and this information can be used to minimize 
the total expected value to be better than just the 
average of all data value delays. 

Paths that are known to have higher than average 
5 usage can be made faster by shortening the number of logic 
blocks they contain or by widening the transistors so that 
the blocks go faster. For example in Figure 5, a net 
improvement in the expected value of delay will result if, 
by inverting both arms of a multiplexor, an invertor is 

10 removed from the arm known to be more frequently chosen, 
even though this results in an invertor being added to the 
other arm. Likewise, if some output of a block must be 
loaded with transistors beginning two different paths, 
-narrowing the transistors in the infrequently chosen path 

15 will slow that path, but will result in an overall 
improvement in expected delay because the output node 
which was also part of the frequently chosen path will be 
faster due to less loading. 

Sftlf-timed SRT Division 

20 Performing division requires making a choice of 

quotient digits starting with the roost significant, and 
progressing to the least significant, digits. The quo- 
tient digit decision is made as a part of each iteration 
which recomputes the next partial remainder based on the 

25 previous partial remainder and quotient digit. Between 
each iteration, the partial remainder is shifted left by 
the base, or radix r, of the digits being used. Each 
iteration thus implements 

Ri +1 = rR, - Dq ; 

30 where Rj is the partial remainder output from stage i, r is 
the radix, q ; is the quotient digit determined from stage, 
D is the Divisor, and the sequence is initialized with 
rR; = the Dividend. 
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in ordinary division, the quotient digits qj are in 

the set {0, ,r-l}, and the full quotient has only a 

single valid representation since each digit position in 
the quotient has only a single correct representation. 
5 Unfortunately, determining the correct digit at each 
position requires comparison of the exact partial remain- 
der, and this means the entire partial remainder must be 
computed before determining each quotient digit. This 
computation requires a complete carry-propagate subtract 
10 to generate the partial remainder before each quotient 
digit may be selected. 

One published algorithm for division is know as the 
SRT algorithm The key idea of SRT division is to avoid a 
complete carry propagation in each iteration by making the 
15 set of valid quotient digits redundant by including both 

positive and negative integers in the set {p, . . . ,0, ,p}. 

The range of quotient digits roust have 

-§ * P * r-l . 



With redundant quotient digit sets, the final quotient 
result can be represented in several different ways, 

20 giving a choice of quotient digits for each position. Any 
valid representation can always, of course, be converted 
to the desired irredundant representation by subtracting 
the positionally weighted negative quotient digits from 
the positionally weighted positive digits. This subtrac- 

25 tion requires a carry propagation, but it is a single 
operation which needs only to be performed once for the 
whole division operation rather than once per stage. 
Further, in an integrated floating-point chip, this full- 
length carry-propagate operation could be performed by 

3 0 shipping the quotient results to a separate part of the 
chip implementing fast carry- look-ahead addition. 
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Since, in SRT division, the quotient set contains 
digits of both signs, the quotient selection logic for a 
given position need only use an approximation of the 
divisor and partial remainder. This is because small 
5 errors may be corrected at a later stage with less signi- 
ficant quotient digits of the opposite sign. Because only 
an approximation of the partial remainder is required at 
each stage for the selection of quotient digits, only a 
small number of the most significant bits of the partial 
10 remainder need to be examined. 

The simplest form of SRT division is to use radix r=2 
with only three quotient digits: +1,0,-1. This requires 
looking at only the top four bits of remainder at each 
stage in order to make the correct quotient digit selec- 
15 tion. The ordinary sequential data-flow for each stage of 
this algorithm is shown in Figure 8. In discrete imple- 
mentations, higher radices such as r=4 and r=l6 have 
routinely been used. 

The probabilistic distribution of quotient digits is 
20 not uniform due to the numerical properties of SRT divi- 
sion. In the radix 2 case, the three quotient digits have 
probabilities of 42%, 35%, and 23%; and 4% of the time it 
is even possible to predict two quotient digits in 
advance. The sign bit of the internal partial remainders 
25 has a 77% probability of being on, even for uniformly 
distributed input operands. These statistics are used to 
speed the more frequently used circuit paths since the 
self-timed implementation which can take advantage of the 
improvement . 

30 Intra-staae Overlapped Execution Innovation 

Prior to this innovation, the steps of the SRT 
division algorithm have been regarded as being purely 
sequential. In this patent, the steps within each stage 
of the algorithm are overlapped which makes it faster by 
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allowing additional parallelism. The data flow for this 
innovation is shown in Figure 9. Specifically, the 
partial 4 -bit carry -save and carry-propagate adders for 
the remainder formation in each stage can operate in 
5 parallel with the previous quotient digit selection and 
the stages own divisor multiple multiplexor and 54-bit 
carry-save adder. One of the inputs to the partial adders 
used to be the chosen divisor multiple from the previous 
stage, which required knowing the selected quotient digit. 

10 But if the partial adders operate in parallel, then the 
quotient digit is not yet determined. Instead, the 
innovation is to duplicate the partial adders for each of 
the possible quotient digits, allowing them to begin 
computation earlier and in parallel, and then choose 

15 . between their results when the quotient digit from the 
previous stage catches up. Since there are three possible 
quotient digits, there needs to be a path for each possi- 
bility. Fortunately, since one of the quotient digits is 
zero, there need be only two partial carry-save adders. 

20 This innovation trims the average propagation delay per 
stage by approximately one-half because the delay is 
dominated by the carry-propagate adder, and the intra - 
stage overlapped execution allows the carry-propagate 
■ additions in two successive stages to be executing simul- 

25 taneously. 

The innovation of intra-stage overlapped execution of 
SRT division can be combined with the ZOSTIL innovation by 
self -timing a sequence of stages, each having the data 
flow of Figure 9. This requires using the merge and join 
30 constructs presented in Figures 3 and 4. A loop of four 
of these stages will repetitively operate as fast as if 
the logic for the stages were assembled into a prohibi- 
tively large combinational array. 
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wpAT rS CIATMED IS: 

1. A zero-overhead self -timed iterative logic 
circuit utilizing CMOS domino function blocks incorporat- 
ing a completion detector associated with each said 
function block for determining that succeeding function 

5 blocks are finished using the data from said function 
block, a function block precharge means being responsive 
to said completion detector to precharge said function 
block only when the outputs have been utilized by succeed- 
ing blocks, whereupon the function block is reset by said 
10 precharge means and the data therein is destroyed. 

2. Pipelines or rings that cn keep data tokens 
separated as they flow through series of precharged stages 
without explicit latches. 

3. Logic that iterates with zero-overhead over the 
raw combinational latencies of its function blocks. 

4. The folded dependency graph analysis method used 
to analyze critical paths in iterating asynchronous 
circuits. 

5. A methodology for designing pipelines or rings 
of function blocks that continuously propagate data 
forward at the speed of a combinational array. 

6. Asynchronous rings that "wrap" a combinational 
array onto a small set of stages, achieving the perform- 
ance of the array without its area. 

7. A self -timed integrated circuit for division 
which iterates at the speed of a combinational array, 
while reguiring the instanciation of only a few stages. 
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AMENDED CLAIMS 

[received by the International Bureau on 24 March 1992 24.03.92) 
original claims 1-7 replaced by amended claims 1-41 (12 pagesJJ 



1. Apparatus for logically evaluating signals and 
providing evaluated signals, comprising: 

a first output terminal, said first output terminal 
operating in a reset phase, a evaluation phase and a 
5 storage phase; 

first resetting means, coupled to said first output 
terminal, for resetting said first output terminal in 
a resetting phase; and 

a first evaluating means, coupled to said first 
10 output terminal, for evaluating at least one signal in 
said evaluation phase; 

wherein said first resetting means and said first 
evaluating means are in inactive states during said 
storage phase", and said first output terminal provides 
15 said evaluated signals during said storage phase. 

2. The apparatus of claim 1, further comprising: 
means, coupled to said first output terminal, for 

sustaining said evaluated signals. 

3. The apparatus of claim 2, wherein said evaluating 
means receives evaluates first states of at least two 
dual-monotonic signals. 

4. The apparatus of claim 3, further comprising: 

a second output terminal, said second output 
terminal operating in said reset, evaluation and said 
storage phases; 
5 second resetting means, coupled to said second 

output terminal, for resetting said second output 
terminal in said resetting phase; and 

a second evaluating means, coupled to said second 
output terminal, for evaluating said second states of 
10 said at least two dual-monotonic signals in said 
evaluation phase; 
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wherein said second resetting means and said second 
evaluating means are in inactive states during said 
storage phase, and said second output terminal sustains 
and provides said evaluated monotonic signals during said 
5 storage. 

5. The apparatus of claim 4, further comprising: 
means, coupled to said second output terminal, for 

sustaining said evaluated signals. 

6. The apparatus of claim 5, 

wherein said first evaluating means includes two 
transistor means, two control terminals of said two 
transistor means receive said first states of two 
5 dual-monotonic signals, and wherein said two transistor 
means are connected in series to said first control 
terminal ; and 

wherein said second evaluating means includes two 
transistor means, two control terminals of said two 
10 transistor means receiving said second states of two 
dual-monotonic signals, and wherein said two transistor 
means are connected in parallel to said second control 
terminal . 

7. A method in use with an apparatus, said apparatus 
including a first output terminal, first resetting means 
coupled to said first output terminal and a first 
evaluating means coupled to said first output terminal, 

5 said method comprising the steps of: 

resetting said first output terminal in a resetting 
phase ; 

evaluating said first terminal in a evaluating 
phase ; 

10 setting said first resetting means and first setting 

in inactive states; and 

sustaining evaluated signals at said first output 
terminal . 
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8. The method of claim 7, wherein said apparatus 
evaluates at least two dual -mono tonic signals and further 
includes a second output terminal, second resetting 
means, and a second evaluating means, wherein said method 

5 further comprises the steps of: 

resetting said second output terminal in a resetting 
phase ; 

evaluating said second terminal in an evaluating 
phase; 

10 setting said second resetting means and second 

setting in inactive states; and 

sustaining evaluated signals at said second output 
terminal . 

9. The method of claim 10, further comprising the steps 
of: 

evaluating first states of two dual-monotonic 
signals at said first output terminal in an AND logic 
5 relation; and 

evaluating second states of two dual-monotonic 
signals at said second output terminal in an OR logic 
relation. 

10. An apparatus, comprising: 

a plurality of function blocks with each of said 
function blocks able to generate a logic output, wherein 
a logic output generated by a trailing function block 
5 provides a logic input of a preceding function block; 

a plurality of detectors, each of said detectors 
being associated with one of said function blocks, for 
generating a completion signal upon detecting that a 
preceding function block has generated a logic output 
10 in responde to a logic output of a trailing function 
block; and 

means for resetting said trailing function block 
in response said completion signal. 
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11. The apparatus of claim 10, further comprising: 
means for generating a final logic output in 

response to logic outputs from said function blocks. 

12. The apparatus of claim 11, further comprising: 
means for generating a final completion signal upon 

detection of said final logic output. 

13. The apparatus of claim 11, wherein each of said 
function blocks is domino circuit. 

14. The apparatus of claim 13, wherein said domino 
circuit receives dual-monotonic signals, said domino 
circuit comprising: 

a first output terminal, said first output terminal 
5 operating in- a reset phase, a evaluation phase and a 
storage phase; 

first resetting means, coupled to said first output 
terminal, for resetting said first output terminal in 
a resetting phase; 
10 a first evaluating means, coupled to said first 

output terminal, for evaluating at least one signal in 
said evaluation phase; 

a second output terminal, said second output 
terminal operating in said reset, evaluation and said 
15 storage phases; 

second resetting means, coupled to said second 
output terminal, for resetting said second output 
terminal in said resetting phase; and 

a second evaluating means, coupled to said second 
20 output terminal, for evaluating said second states of 
said at least two dual-monotonic signals in said 
evaluation phase; 

wherein said first resetting means and said first 
evaluating means are in inactive states during said 
25 storage phase, and said first output terminal sustains 
and provides said evaluated signals during said storage; 



WO 92/07361 



-21- 



PCT/US91/07256 



wherein said second resetting means and said second 
evaluating means are in inactive states during said 
storage phase, and said second output terminal sustains 
and provides said evaluated monotonic signals during said 
5 storage . 

15. The apparatus of claim 14, 

wherein said first evaluating means includes two 
transistor means, two control terminals of said two 
transistor means receive said first sates of two 
5 dual-monotonic signals, and wherein said two transistor 
means are connected to said first control terminal in 
series ; and 

wherein said second evaluating means includes two 
transistor means, two control terminals of said two 
10 ■ transistor means receive said second sates of two 
dual-monotonic signals, and wherein said two transistor 
means are connected to said second control terminal in 
parallel. 

16. The apparatus of claim 11, 

wherein said logic outputs generated by said 
function blocks are provided only in forward fashion, 
and wherein said completion signals are provided only 
5 in backward handshake fashion. 

17. The apparatus of claim 16, wherein connections of 
said function blocks forms a limited path, wherein said 
apparatus further comprising: 

a plurality of control blocks for controlling said 
5 detectors and resetting means; 

wherein said function blocks and control blocks are 
designed so that no control blocks are located at said 
limited path so that raw combinational latencies are 
determined by said function blocks and will not affected 
10 by control blocks. 
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18. The apparatus of claim 17, wherein if any said 
control blocks are located at said limited path, such 
control block can be removed from said limited path by 
altering said function blocks and said control blocks. 

19. The apparatus of claim 18, wherein each of control 
block includes detector and resetting means. 

20. The apparatus of claim 17, wherein said function 
blocks are connected in a ring fashion. 

21. The apparatus of claim 16, further comprising: 
means for merging at least two logic outputs 

generated by at least two trailing function blocks to . 
one data path so that said at least two logic outputs 
5 are provided to one preceding function block. 

22. The apparatus of claim 16, further comprising: 
means for splitting a logic output generated by a 

trailing function block to at least two data paths so 
that said logic output is provided to at least two 
5 preceding function blocks. 

23. A method in use with an apparatus, said apparatus 
including a plurality function blocks with each of said 
function blocks able to generate a logic output, wherein 
a logic output generated by a trailing function block 

5 provides a logic input of a preceding function block, 
said method comprising the steps of: 

detecting completion of generation of logic output 
for each of said block function; 

generating a completion signal upon detecting that 
10 a preceding function block has generated a logic output 
in responde to a logic output of a trailing function 
block; and 

resetting said trailing function block in response 
said completion signal. 
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24. The method of claim 23, further comprising the step 
of: 

generating a final logic output in response to logic 
outputs from said function blocks. 

25. The method of claim 24, further comprising the step 
of: 

generating a final completion signal upon detection 
of said final logic output. 

26. The method of claim 25, wherein connections of said 
function blocks form a limited path, wherein said 
apparatus further including a plurality of control blocks 
for controlling said detectors and resetting means, 

5 wherein said method further comprising the step of: 

arranging control blocks outside of said limited 
path so that raw combinational latencies are determined 
• by said function blocks and will not be affected by said 
control blocks. 

27. The method of claim 26, in case that any said 
control blocks are located at said limited part, further 
comprising the step of: 

granizing said function blocks and said control 
5 blocks so that such control blocks can be removed from 
said limited path. 

28. An apparatus for generating a logic output, 
comprising: 

a plurality of paths, each of said paths generates 
a intermediate in response to plurality of logic inputs; 
5 wherein one of said paths has high probability of 

usage ; and 

wherein said devices in said path have wider conduct 
path to increase speed of said path. 
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29. An method for designing an apparatus for generating 
a logic output, said apparatus includes a plurality of 
paths, and each of said paths generates a intermediate 
in response to plurality of logic inputs, said method 

5 comprising the steps of: 

determining which of the paths has high probability 
of usage; and 

creating wider conducting path for devices in said 
path to increase speed of said path. 

30. An apparatus for generating a pre-determined logic 
output , compr i s ing : 

a plurality of paths, each of said paths generating 
an intermediate signal in response to a plurality of 
5 logic inputs; 

wherein one of said paths has high probability of 
usage, and other of said paths have a lower probability 
of usage; 

wherein said devices in said path are simplified 
10 to shorten logic blocks to increase speed of said path; 
and 

wherein said devices in said some other paths are 
modified to generate equivalence of said pre-determined 
logic output. 

31. A method for designing an apparatus for generating 
a pre-determined logic output, said apparatus includes 
a plurality of paths, each of said paths generates an 
intermediate response to a plurality of logic inputs, 

5 said method comprising: 

determining a path that has high probability of 
usage; 

determining other of said paths that have lower 
probability of usage; 
10 simplifying devices in said path to shorten logic 

blocks; and 
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modifying devices in said some other paths to 
generate equivalence of said pre-determined logic output. 

32. An apparatus for generating a division result, said 
apparatus including a plurality of function blocks, each 
of said function blocks generating a remainder and a 
quotient digit, said remainder and quotient digit 

5 generated by a trailing function block comprising inputs 
for a preceding function block, each of said function 
blocks comprising: 

means for generating a remainder in response to a 
remainder and a quotient digit generated by said trailing 
10 function block; and 

means for generating a quotient digit in response 
to said remainder and said quotient digit generated by 
said trailing function block; 

wherein said remainder and said quotient digits 
15 generated by said trailing function block are provided 
to said remainder generating means and said quotient 
generating means in parallel. 

33. The apparatus of claim 32, wherein said remainder 
generating means further comprising: 

means for providing an original state of said 
quotient digit generated by said trailing function block, 
5 an original state of said quotient digit generated by 
said trailing function block, or a zero state; and 

means for adding a remainder generated by said 
trailing function block and an output from said providing 
means ; 

10 wherein said adding means is capable of saving whole 

bit carries. 
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34. The apparatus of claim 33, wherein said quotient 
digit generating means further comprises: 

first adding means for adding a remainder generated 
by said trailing function block and said original state 
5 of said quotient digit generated by said trailing 
function block, wherein said first adding means is 
capable of saving partial bit carries ; 

second adding means for adding a remainder generated 
by said trailing function block and said opposite state 
10 of said quotient digit generated by said trailing 
function block, wherein said first adding means is 
capable of saving partial bit carries; and 

means for selecting outputs from said first adding 
means, said second adding means and said zero state. 

35. The apparatus of claim 32 or 34, each function block 
operates in a resetting phase, an operation phase and 

. a storage phase, said apparatus further comprising: 

a plurality of detectors, with each of said 
5 detectors associated with one of said function blocks, 
for generating a completion signal upon detecting that 
a preceding function block has generated a remainder and 
a quotient digit in response to a remainder and a 
quotient digit generated by a trailing function block; 
10 and 

means for resetting said trailing function block 
in response said completion signal. 

36. The apparatus of claim 35, 

wherein said remainders and quotient digits 
generated by said function blocks are provided only in 
forward fashion, and wherein said completion signals are 
5 provided only in backward handshake fashion. 

37. The apparatus of claim 36, wherein connections of 
said function blocks form a limited path, wherein said 
apparatus further comprising: 
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a plurality of control blocks for controlling said 
detectors and resetting means; 

wherein said function blocks and control blocks are 
designed so that no control blocks are located at said 
5 limited path so that raw combinational latencies are 
determined by said function blocks and will not be 
affected by said control blocks. 

38. The apparatus of claim 37, wherein if any said 
control blocks are located at said limited part, such 
control block can be removed out of said limited path 
by granizing said function blocks and said control 

5 blocks. 

39. A method used with an apparatus for generating a 
" division result, said apparatus includes a plurality of 

function blocks, each of said function blocks generating 
a remainder and a quotient digit , said remainder and 
5 quotient digit generated by a trailing function block 
serve input for a preceding function block, each of said 
function block comprising the steps of: 

generating a remainder in response to a remainder 
and a quotient digit generated by a trailing function 
10 block; and 

generating a quotient digit in response to said 
remainder and said quotient digit generated by said 
trailing function block; 

wherein said remainder generating step and quotient 
15 digit generating step receives said remainder and 
quotient digit generated by said trailing function block 
in parallel. 

40. The method of claim 39, wherein said remainder 
generating step further comprises the steps of: 

providing original state of said quotient digit 
generated by said trailing function block, original state 
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of said quotient digit generated by said trailing 
function block, or zero etate; and 

adding a remainder generated by said trailing 
function block and an output from said providing means; 
5 wherein said adding step is capable of saving whole 

bit caries. 

40. The method of claim 33, wherein said quotient digit 
generating means step comprises the steps of: 

a first adding step for adding a remainder generated 
by said trailing function block and said original state 
5 of said quotient digit . generated by said trailing 
function block, wherein said first adding means is 
capable of saving partial bit carries; 

a second adding step for adding a remainder 
generated by said trailing function block and said 
10 opposite state of said quotient digit generated by said 
trailing function block, wherein said first adding means 
is capable of saving partial bit carries; and 

selecting outputs from said first adding step, 
second adding step and zero state. 

41. The method of claim 38 or 40, each function block 
operating under a resetting phase, an operation phase 
and a storage phase, said apparatus including a plurality 
of detectors, with each of said detectors associated with 

5 one of said function blocks, said method further 
comprising the steps of: 

generating a completion signal upon detecting that 
a preceding function block has generated a remainder and 
a quotient digit in response to a remainder and a 
10 quotient digit generated by a trailing function block; 
and 

resetting said trailing function block in response 
said completion signal. 
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